Add RFC2047-compliant MIME text decoder #9313

dmsnell · 2025-07-23T18:15:48Z

Status

Please feel free to ignore this for now.

Description

The existing wp_iso_descrambler() was added in 2004 because certain email subjects were appearing with funny-looking string spans. The following note was left as a comment:

this may only work with iso-8859-1, I'm afraid

But even so, it’s only likely to truly work with US-ASCII, which is rare to find in such a MIME-encoded string. In 2004 it might have been more common for PHP systems to operate on ISO-8859-1 (latin1) as their default, but today UTF-8 is the predominant encoding and because the function return the bytes as they are directly encoded, it fails to perform its main function which is to translate non-ASCII encodings.

The above image illustrates how the bytes print as an invalid UTF-8 sequence in trunk after decoding. The 0x80 byte was chosen for this demonstration because in latin1 it’s a control character, in cp1252 and in HTML it’s remapped to the Euro sign, and in UTF-8 it’s an invalid sequence.

Without additional conversion calling code has to know the additional details of what the encoding is of the running PHP system and what other code will perform re-encoding. It’s likely to mess up. Worse, if the encoding is not ISO-8859-1 (latin1) then the decoding is wrong for all character sets.

var_dump( wp_iso_descrambler( '=?ISO-8859-2?Q?=A3=F3d=BC?=' ) );
string(4) "��d�"

This patch implements a compliant RFC2047 MIME text decoder, and decodes the text into UTF-8. Decoding into a single encoding normalizes the output and gives calling code the freedom to change the encoding if it wants without needing to make any assumptions or inquire about what it gets.

var_dump( rfc2047_decode( '=?ISO-8859-2?Q?=A3=F3d=BC?=' ) );
string(6) "Łódź"

With the same input as above we can see that the default output is now converted from the indicated input encoding. In this example, that decodes to a control character in UTF-8 but that is authentic to the given input. The re-encodings are now invalid because the returned data is already in UTF-8.

Supported encodings

This implementation attempts to support as many encodings as are practical based on the availability of decoding logic on the running server.

If mb_convert_encoding() is available it will be preferred, followed by iconv(), followed by direct conversion from US-ASCII or UTF-8 byte streams. Nuances and peculiarities of the PHP text-encoding functions are left as artifacts of PHP and not addressed in this function.

Error handling

Unfortunately, even where iconv_mime_decode() is available, its error-handling options are limited and unclear. By implementing the encoder in user-space the error cases can be explicitly handled, and this implementation provides configurable error handling:

By default, invalid encoded words are preserved as unencoded plain text. This corresponds to the preserve-errors flag. The input text will appear in the output and look jumbled, but perhaps a human can make sense of the data in it. This is how most decoders handle errors.
Passing in replace-errors will remove the entire encoded word and replace it with the replacement character U+FFFD �. This discards information from the input, but leaves a placemarker indicating that it was there before.
Passing in bail-on-error will cause the function to return early and return null, effectively the same as the strict mode in other decoders.

There are multiple classes of potential errors and error behavior is not defined in the RFC. This implementation treats all classes in the same way, except for the rule that encoded words must be 75 characters or shorter (as this rule was clearly intended for encoders to make the job of decoding simpler, but otherwise does not speak to the well-formedness of the encoding).

Unsupported character sets.
Invalid encodings (B and Q are supported).
Invalid byte sequences in the quoted-printable encoding, such as =. or =6f (only upper-case hex digits are allowed).
Invalid base64-decoding in the binary encoding.
Invalid character re-encoding on the decoded byte stream.

Of note, the RFC implies no possible syntax errors. Instead, anything which appears as a syntax error indicates that the span of text which looks like an encoded word is actually just plain text and the parser will skip over it to look for the next well-formed encoded word.

Notes

github-actions · 2025-07-23T18:29:35Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

Questions arise around unspecified failure behaviors. - What if the syntax is obviously supposed to be an encoding but technically isn’t? For example, it’s missing a closing '?' It may be computationally heavy to _guess_ if something is broken syntax, so some failures are ambiguous if they should copy the input plaintext or return null. - What do other high-quality libraries do with errors?

dmsnell force-pushed the add/mime-decoder branch 2 times, most recently from 42a1358 to eca673c Compare August 12, 2025 20:11

dmsnell force-pushed the add/mime-decoder branch 5 times, most recently from ae3f2bd to 4b4ef54 Compare September 16, 2025 12:43

dmsnell force-pushed the add/mime-decoder branch 6 times, most recently from f09a528 to 69fc308 Compare September 25, 2025 18:47

dmsnell force-pushed the add/mime-decoder branch 5 times, most recently from 4965e92 to 3efffd0 Compare October 6, 2025 21:17

dmsnell force-pushed the add/mime-decoder branch 4 times, most recently from a3fdc53 to e974fdf Compare October 9, 2025 23:39

dmsnell force-pushed the add/mime-decoder branch 3 times, most recently from 0c0f4e7 to 0d97c25 Compare October 21, 2025 09:22

dmsnell mentioned this pull request Nov 6, 2025

Support rfc6530 #5237

Open

dmsnell force-pushed the add/mime-decoder branch from 0d97c25 to 3676c71 Compare November 24, 2025 20:25

dmsnell force-pushed the add/mime-decoder branch from 3676c71 to 39fb139 Compare December 18, 2025 03:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add RFC2047-compliant MIME text decoder #9313

Add RFC2047-compliant MIME text decoder #9313

Uh oh!

dmsnell commented Jul 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add RFC2047-compliant MIME text decoder #9313

Are you sure you want to change the base?

Add RFC2047-compliant MIME text decoder #9313

Uh oh!

Conversation

dmsnell commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

Description

Supported encodings

Error handling

Uh oh!

github-actions bot commented Jul 23, 2025

Test using WordPress Playground

Some things to be aware of

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dmsnell commented Jul 23, 2025 •

edited

Loading